class: center, middle, inverse, title-slide .title[ # Text analysis II: Commending the tm package ] .subtitle[ ## Introduction to R for Social Sciences (Sociology ∞ HERO) ] .author[ ### Josef Ginnerskov, doctoral candidate ] .institute[ ### Department of Sociology ] .date[ ### 2022-06-03 ] --- # Loading character vector documents ```r vectordata <- c(Durkheim = "Sociological method as we practice it rests wholly on the basic principle that social facts must be studied as things, that is, as realities external to the individual. There is no principle for which we have received more criticism; but none is more fundamental. Indubitably for sociology to be possible, it must above all have an object all its own. It must take cognizance of a reality which is not in the domain of other sciences... there can be no sociology unless societies exist, and that societies cannot exist if there are only individuals.", Weber = "'Sociology' is a word which is used in many different senses. In the sense adopted here, it means the science whose object is to interpret the meaning of social action and thereby give a causal explanation of the way in which the action proceeds and the effects which it produces. By 'action' in this definition is meant human behaviour when and to the extent that the agent of agents see it as subjectively meaningful: the behaviour may be either internal or external, and may consist in the agent's doing something, omitting to do something, or having something done to him. By 'social' action is meant an action in which the meaning intended by the agent or agents involves a relation to another person's behaviour and in which that relation determines the way in which the action proceeds.", Simmel = "I UNDERSTAND the task of sociology to be description and determination of the historico-psychological origin of those forms in which interactions take place between human beings. The totality of these interactions, springing from the most diverse impulses, directed toward the most diverse objects, and aiming at the most diverse ends, constitutes 'society'. Those different contents in connection with which the forms of interaction manifest themselves are the subject-matter of special sciences. These contents attain the character of social facts by virtue of occurring in this particular form in the interactions of men.", Tarde = "I will pass over a number of secondary objections which the application of the sociological point of view may encounter along its way. Since, after all, the fundamental nature of things is strictly inaccessible, and we are obliged to construct hypotheses in order to penetrate it, let us openly adopt this one and push it to its conclusion. Hypotheses fingo, I say naively. What is dangerous in the sciences are not tightly linked conjectures, logically followed to the ultimate depths or the ultimate precipices, but rather the ghosts of ideas which float aimlessly in the mind. The universal sociological point of view seems to me to be one of these spectres which haunt the brains of our speculative contemporaries.") ``` --- # Creating a corpus (vector source) ```r library(tm) #Package for text mining tasks corpus <- VCorpus(VectorSource(vectordata)) # create a volatile corpus, kept in memory as a R object. inspect(corpus) ``` ``` ## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 4 ## ## $Durkheim ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 549 ## ## $Weber ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 793 ## ## $Simmel ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 624 ## ## $Tarde ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 718 ``` --- # Exporting corpus and importing txt ```r writeCorpus(corpus) dircorpus<-VCorpus(DirSource(pattern = ".txt")) #default is reading txt from wd inspect(dircorpus) ``` ``` ## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 4 ## ## [[1]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 549 ## ## [[2]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 793 ## ## [[3]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 624 ## ## [[4]] ## <<PlainTextDocument>> ## Metadata: 7 ## Content: chars: 718 ``` --- # Managing corpus metadata ```r meta(corpus) <- c("Durkheim", "Weber", "Simmel", "Tarde") meta(corpus) meta(corpus[[1]], "author") <- "Durkheim" meta(corpus[[2]], "author") <- "Weber" meta(corpus[[3]], "author") <- "Simmel" meta(corpus[[4]], "author") <- "Tarde" meta(corpus[[3]]) ``` ``` ## data frame with 0 columns and 4 rows ``` ``` ## author : Simmel ## datetimestamp: 2022-06-03 20:50:04 ## description : character(0) ## heading : character(0) ## id : 3 ## language : en ## origin : character(0) ``` --- # Preprocessing - to lower case ```r cleancorpus <- tm_map(corpus, content_transformer(tolower)) cleancorpus[[3]]$content # What happened to Simmel? ``` ``` ## [1] "i understand the task of sociology to be description and determination of the historico-psychological origin of those forms in which interactions take place between human beings. the totality of these interactions, springing from the most diverse impulses, directed toward the most diverse objects, and aiming at the most diverse ends, constitutes 'society'. those different contents in connection with which the forms of interaction manifest themselves are the subject-matter of special sciences. these contents attain the character of social facts by virtue of occurring in this particular form in the interactions of men." ``` --- # Preprocessing - remove punctuation ```r cleancorpus <- tm_map(corpus, removePunctuation) cleancorpus[[3]]$content # What happened to Simmel? ``` ``` ## [1] "I UNDERSTAND the task of sociology to be description and determination of the historicopsychological origin of those forms in which interactions take place between human beings The totality of these interactions springing from the most diverse impulses directed toward the most diverse objects and aiming at the most diverse ends constitutes society Those different contents in connection with which the forms of interaction manifest themselves are the subjectmatter of special sciences These contents attain the character of social facts by virtue of occurring in this particular form in the interactions of men" ``` --- # Preprocessing - strip whitespace ```r cleancorpus <- tm_map(corpus, stripWhitespace) cleancorpus[[3]]$content # What happened to Simmel? ``` ``` ## [1] "I UNDERSTAND the task of sociology to be description and determination of the historico-psychological origin of those forms in which interactions take place between human beings. The totality of these interactions, springing from the most diverse impulses, directed toward the most diverse objects, and aiming at the most diverse ends, constitutes 'society'. Those different contents in connection with which the forms of interaction manifest themselves are the subject-matter of special sciences. These contents attain the character of social facts by virtue of occurring in this particular form in the interactions of men." ``` --- # Preprocessing - managing stopwords (1/2) ```r stopwords("en") stopcorpus <- tm_map(corpus, removeWords, stopwords("en")) mystopwords <- c("something", "can", "must", "since") stopcorpus <- tm_map(stopcorpus, removeWords, mystopwords) ``` ``` ## [1] "i" "me" "my" "myself" "we" ## [6] "our" "ours" "ourselves" "you" "your" ## [11] "yours" "yourself" "yourselves" "he" "him" ## [16] "his" "himself" "she" "her" "hers" ## [21] "herself" "it" "its" "itself" "they" ## [26] "them" "their" "theirs" "themselves" "what" ## [31] "which" "who" "whom" "this" "that" ## [36] "these" "those" "am" "is" "are" ## [41] "was" "were" "be" "been" "being" ## [46] "have" "has" "had" "having" "do" ## [51] "does" "did" "doing" "would" "should" ## [56] "could" "ought" "i'm" "you're" "he's" ## [61] "she's" "it's" "we're" "they're" "i've" ## [66] "you've" "we've" "they've" "i'd" "you'd" ## [71] "he'd" "she'd" "we'd" "they'd" "i'll" ## [76] "you'll" "he'll" "she'll" "we'll" "they'll" ## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't" ## [86] "haven't" "hadn't" "doesn't" "don't" "didn't" ## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't" ## [96] "cannot" "couldn't" "mustn't" "let's" "that's" ## [101] "who's" "what's" "here's" "there's" "when's" ## [106] "where's" "why's" "how's" "a" "an" ## [111] "the" "and" "but" "if" "or" ## [116] "because" "as" "until" "while" "of" ## [121] "at" "by" "for" "with" "about" ## [126] "against" "between" "into" "through" "during" ## [131] "before" "after" "above" "below" "to" ## [136] "from" "up" "down" "in" "out" ## [141] "on" "off" "over" "under" "again" ## [146] "further" "then" "once" "here" "there" ## [151] "when" "where" "why" "how" "all" ## [156] "any" "both" "each" "few" "more" ## [161] "most" "other" "some" "such" "no" ## [166] "nor" "not" "only" "own" "same" ## [171] "so" "than" "too" "very" ``` --- # Preprocessing - managing stopwords (2/2) ```r stopcorpus[[3]]$content # What happened to Simmel? ``` ``` ## [1] "I UNDERSTAND task sociology description determination historico-psychological origin forms interactions take place human beings. The totality interactions, springing diverse impulses, directed toward diverse objects, aiming diverse ends, constitutes 'society'. Those different contents connection forms interaction manifest subject-matter special sciences. These contents attain character social facts virtue occurring particular form interactions men." ``` --- # Preprocessing - stemming ```r library(SnowballC) # Package for stemming cleancorpus <- tm_map(corpus, stemDocument) cleancorpus[[3]]$content # What happened to Simmel? ``` ``` ## [1] "I UNDERSTAND the task of sociolog to be descript and determin of the historico-psycholog origin of those form in which interact take place between human beings. The total of these interactions, spring from the most divers impulses, direct toward the most divers objects, and aim at the most divers ends, constitut society'. Those differ content in connect with which the form of interact manifest themselv are the subject-matt of special sciences. These content attain the charact of social fact by virtu of occur in this particular form in the interact of men." ``` --- # Applying all preprocessing tasks ```r cleancorpus <- tm_map(corpus, content_transformer(tolower)) cleancorpus <- tm_map(cleancorpus, removePunctuation) cleancorpus <- tm_map(cleancorpus, removeWords, stopwords("en")) cleancorpus <- tm_map(cleancorpus, stemDocument) cleancorpus <- tm_map(cleancorpus, stripWhitespace) cleancorpus[[3]]$content # What happened to Simmel? ``` ``` ## [1] "understand task sociolog descript determin historicopsycholog origin form interact take place human be total interact spring divers impuls direct toward divers object aim divers end constitut societi differ content connect form interact manifest subjectmatt special scienc content attain charact social fact virtu occur particular form interact men" ``` --- # Generating a document-term matrix ```r dtm <- DocumentTermMatrix(cleancorpus) #Generate dtm from the preprocessed corpus cleandtm <- DocumentTermMatrix(corpus, #Generate and preprocessing a dtm from the original corpus control = list(removePunctuation = TRUE, stripWhitespace = TRUE, removeSparseTerms = 0.99, stopwords = TRUE, stemming = TRUE)) inspect(cleandtm) #Inspect dtm ``` ``` ## <<DocumentTermMatrix (documents: 4, terms: 146)>> ## Non-/sparse entries: 169/415 ## Sparsity : 71% ## Maximal term length: 18 ## Weighting : term frequency (tf) ## Sample : ## Terms ## Docs action agent behaviour divers form interact object scienc social sociolog ## 1 0 0 0 0 0 0 1 1 1 3 ## 2 6 5 3 0 0 0 1 1 2 1 ## 3 0 0 0 3 3 4 1 1 1 1 ## 4 0 0 0 0 0 0 1 1 0 2 ``` --- # Creating dicionaries ```r inspect(DocumentTermMatrix(corpus, list(dictionary=c("sociology","structure","action")))) #dictionary ``` ``` ## <<DocumentTermMatrix (documents: 4, terms: 3)>> ## Non-/sparse entries: 3/9 ## Sparsity : 75% ## Maximal term length: 9 ## Weighting : term frequency (tf) ## Sample : ## Terms ## Docs action sociology structure ## 1 0 2 0 ## 2 5 0 0 ## 3 0 1 0 ## 4 0 0 0 ``` --- # Operating a dtm - terms per corpus ```r findFreqTerms(cleandtm, 4) # find terms appearing at least 4 times ``` ``` ## [1] "action" "agent" "interact" "object" "scienc" "social" "sociolog" ``` --- # Operating a dtm - terms per doc ```r findMostFreqTerms(cleandtm) # find most frequent terms for each document ``` ``` ## $`1` ## must sociolog exist individu principl realiti ## 3 3 2 2 2 2 ## ## $`2` ## action agent behaviour mean someth may ## 6 5 3 3 3 2 ## ## $`3` ## interact divers form content aim attain ## 4 3 3 2 1 1 ## ## $`4` ## hypothes one point sociolog ultim view ## 2 2 2 2 2 2 ``` --- # Operating a dtm - term correlations ```r findAssocs(cleandtm, terms = "social", corlimit = 0.6) # terms correlating to a specific word ``` ``` ## $social ## action agent anoth behaviour causal consist definit done ## 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 ## effect either explan extent give intend intern interpret ## 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 ## involv mani mean meaning meant omit person proceed ## 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 ## produc relat see sens someth subject therebi use ## 0.82 0.82 0.82 0.82 0.82 0.82 0.82 0.82 ## whose word determin differ extern human ## 0.82 0.82 0.71 0.71 0.71 0.71 ``` --- # BONUS: From dtm to word cloud ```r library(wordcloud) #Packge for generating wordclouds df <- as.matrix(dtm) df <- sort(colSums(df),decreasing=TRUE) df <- data.frame(word = names(df),freq=df) wordcloud(df$word,df$freq,colors = brewer.pal(12, "Dark2")) ``` --- # BONUS: From dtm to word cloud (viz) <img src="data:image/png;base64,#R-course-ginnerskov-part2-2022_files/figure-html/unnamed-chunk-34-1.png" width="80%" style="display: block; margin: auto;" /> --- # Thank you for your time! ## Do not hesitate to contact me | | | |:---------------------------------------------------------------------------------------------|:-------------------------| | <a href="mailto:josef.ginnerskov@soc.uu.se">.UUred[<i class="fa fa-paper-plane fa-fw"></i>] |josef.ginnerskov@soc.uu.se | | <a href="http://twitter.com/doeparen">.UUred[<i class="fa fa-twitter fa-fw"></i>] |@doeparen | | <a href="http://github.com/doeparen">.UUred[<i class="fa fa-gitlab fa-fw"></i>] |@doeparen |